home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
TPUG - Toronto PET Users Group
/
TPUG Users Group CD
/
TPUG Users Group CD.iso
/
AMIGA
/
(A)TC
/
(A)TCA.ADF
/
Scan
/
scan.readme
< prev
next >
Wrap
Text File
|
1992-07-29
|
18KB
|
378 lines
Program: Scan
Version: 1.0 5/17/92
Utility: Searches twice as fast on hard drives and five times faster
in ram than the best search programs currently available.
Option to scan selective(wildcard) internal LZH and LHA files
Supports searching for multiple patterns simultaneously
with little speed degradation.
Option to output whole article when a match is found.
Extensive wildcard support(#?,*,?,[],[^],[-],+,|,&,..).
Optional inverted pattern matching.
Recursive directory scanning.
Support for \x?? and \? in patterns and article separator.
Line search highlights matching words with selectable color.
Freeware.
Tribute: Eternal praise to Jesus for saving my soul and for the wonders
of God's creation.
Legal: Copyright ⌐ 1991, 1992 by Walter Rothe
This program is freely distributable, but copyrighted by me. This
means that you can copy it freely as long as you don't ask for any
more money than a nominal fee for copying. This program may be
placed on Public Domain disks, like Fred Fish's library. To
distribute this program you must include the program,
documentation, and test files in their original unmodified form.
This does not preclude compression by archiving programs like
lharc. This program cannot be used for commercial purposes
without written permission from the author. The author can not
be made responsible for any damage which is caused by using
this program. Derivative works must be released with source along
with the executable or provisions made to provide the user source,
if requested. Uploading source to a major US bulletin board system
within 6 months of the time of the request satisfies this
requirement. Improvements to this program must include this file,
with any changes in functionality documented, keeping the
"Tribute", "Legal", "Author", and "Contact" sections unchanged
and including an updated "scan.revisions" and "scan.todo" file.
The example configuration files should also be included.
Command: Scan -[nprt] -[hColor] -[lNumLines] -[wWinSize] -[oOutFile]
format -[zLHAWild] SrchFile(s) Pattern
OR
Scan -f[CnfgFile] -[oOutFile] -[pr] -[zLHAWild] SrchFile(s)
OR
Scan -a[ipr] [-cColumn] -[sArtSep] -[oOutFile] -[wWinSize]
-[zLHAWild] SrchFile(s) Pattern
OR
Scan -[vx]
Names: Color - A two digit number, where the 1st number indicates
the color the matching word is highlighted with,
and the second is the color the filenames of the
files being searched are highlighted with. Currently
limited to 0-9.
NumLines - Number of lines of context information printed
around match.
WinSize - Number of bytes in window. Default 16K bytes.
Modulus( WinSize, 4 ) must be 0. There are
three buffers, each WinSize long that are swapped.
Larger size windows usually increase the speed,
except when handling large numbers of small files.
Also, the larger the window, the more context
information can be printed. Context info is limited
to whats in the present and previous buffer. Large
article may need a large window to be fully printed
out. Currently the WinSize is forced to 16kb
whenever the -z option is in effect.
OutFile - Pathname of file output will be put into.
LHAWild - Wildcard pattern that is used to determine which
internal LHA files are scanned. Only "#?", "*", and
"?" wildcards are permitted here. Note that the full
internal filename must be matched. Any directories
must be included. Some shells expand wildcards on
the command line so you may need to enclose the
option with quotes. i.e. "-z*.c"
SrchFile(s) - Pathname of file(s) to be searched. Only "#?", "*",
and "?" wildcards are permited here. For recursive dir
scans, you need only specify the directory pathname.
You can optionally add a "/" or "/*.h" to the end of
the directory pathname. The command line is limited
to 255 chars so if you specify a wildcard pattern and
the pathnames of the matching files exceed 255 chars,
the last files wont be scanned. To get around this,
enclose the wildcard with quotes. i.e. "*.l??"
Pattern - Pattern to search for. This can include the wildcards
"#?", "*", "?", "&", "|", "+", and "..". Refer to
the following section on pattern matching. Note that
you may need to enclose some of the wildcards in
quotes to keep the shell from expanding them. Also,
there must be at least 2 consequtive unique non
wildcard characters in each pattern between the
"|" wildcards. Pattern is case insensitive. For
example "sale..d*paint[3i]|paint&prog"
CnfgFile - Pathname of file containing article separator,
column for article separator, inversion flag,
window size, and search pattern with each on a
separate line, starting in column 1. Note that each
pattern between "|"'s appears on a separate line
without the "|". There is a maximum of 125 of these.
If the "-f" option is not followed with a name,
then file S:scan.config is used.
ArtSep - Article separator. Defaults to "\nArticle". Note
that a "\" in the article separator has the same
meaning as that in the pattern matching algorithm
explained below. An article separator must be 2 or
more characters long. For example, "\n\n" is a
valid separator and causes a new article to be
started with each blank line. Note that some shells
require you to specify this as "\\n\\n".
Options:
-a Article scan. Prints out all articles with matches.
-cColumn Column article separtor must be in(1..?).
-fCnfgFile Get parms from config file.
-f Get parms from s:scan.config.
-hxy Highlight match with x color and pathname, y color.
-i Invert matching so nonmatching articles are printed.
-lxx Line search with xx lines around target printed.
-n Print line numbers with matched text(slower).
-oOutFile Send output to file.
-p Always print file pathnames scanned.
-r Recursively scan down directories.
-sArtSep Article separator(def Article).
-t Truncate output to window width. Only works with -n.
-v Print version number. Other options nulled.
-wWinSize Window size(def 16384 bytes). Mod(size,4) must be 0.
-x Print out more help information. Nulls other options.
-zWildPat Enabl decomp of .lzh and .lha files with int files
matching WildPat.
-z Enable decompression of all .lzh and .lha files.
Pattern:
matching
? Matches any single character except
[chars] Match any characters within braces. i.e. [abcxyz]
[c1-c2] Match any character from c1 to c2. i.e. [a-x]
[^chars] Match any characters not within the brackets.
\xYY Matches hex number YY as a character. Note that back
slashes are preprocessed by some shells. You may need
to put two slashes back to back to prevent this.
\Y Matches the standard C escape sequence Y
\YYY Matches the decimal number YYY as a character
| Finding the pattern on the left or right causes a match
+ Same as |
#? Pattern on left and right must both match and be in the
same word. Match on left must come before one on right.
* Same as #?
& Pattern on left and right must both match and be in the
same sentence. Match on left must come before match on
the right. A sentence is delimited by:
a) period followed by a space or line feed
b) a maximum of OVERLAP(512) characters
c) two newline chars with no chars but ">" between them
d) start or end of article
e) newline before a colon
.. Pattern on left and right must both match and be in the
same article. Order of left and right matches is not
important. This is faster than "&". This wildcard is
only used during article scans(-a or -f), not line
scans.
Config:
format
line 1 Article separator
line 2 Column article separator must be in. 0 -> ignore
line 3 Invert match flag. 1 -> invert match. 0 -> normal
line 4 Window size in bytes. Mod( size, 4 ) must be 0.
line 5..129 Search patterns. There is an implicit "|" between
each line. There may be fewer than 125 pattern lines
if there are any ".." wildcards on a line. The number
of lines is reduced directly by the number of ".."'s.
Background:
I wrote "scan" to help minimize the time I spend scanning the
very large(megabyte) USENET proceedings I download weekly. This
program scans a file or set of files looking for strings matching a
user specified pattern(s). It supports the traditional #?, *, ?, and
[] wildcards, but includes three new ones; "&" and "|" which are
similar to "*" but work on sentences and articles instead of words
and ".." which is similar to "&" but is order independent. "Scan"
can print out the entire text of an article if a match is found
anywhere in the article. It can search recursively down directories
and do inverse pattern matching where only articles that don't match
the pattern are printed out. Due to the size of Usenet proceedings,
it's desirable to keep them archived in compressed form. Scan
supports .LZH and .LHA formats with .LHA being significantly faster
and more dense. You can even selectively scan user specified
internal .LZH and .LHA files using a wildcard pattern. Finally, up
to 125 patterns can be scanned for simultaneously, with minimal
speed degradation.
The fastest search programs I've seen to date are "zgrep" and the
"csh" search command. "Zgrep has the edge when run out of ram
but "csh" does better on hard disk searches. "Scan" searches twice
as fast as "csh" on harddrives and 3 times faster than "zgrep".
It searches 5 times faster than "zgrep" in ram and 15 times faster
than "csh" in ram. Search time is about the same for all 3 when
done off of the floppy.
Algorithm:
A preprocessor selects the least repetitive two character
sub-pattern from each major term of the pattern. An even and an
odd two character subpattern is selected. This allows 16 bits
to be processed at a time in the inner loop. These two character
subpatterns are used to do a parallel Boyer-Moore type search.
If a match is found with the two char subpattern, the rest of
the pattern is checked. If the full major term matches, a flag is
set and other flags examined to see if the full minterm matches.
If so, another flag is set to cause the article to be printed out.
A triple buffer approach is used with Matt Dillons asynchronous
I/O to help speed file reads. Thanks Matt. A special two character
end of buffer subpattern is put at the end of the buffer so EOF
wont have to be checked for after each pattern check.
Examples:
Searching for sentences with the words "Amiga" and "commercial"
in them is specified with:
amiga&commercial | commercial&amiga
If "Amiga" and commercial don't have to be in the same sentence,
then it can be done with:
amiga..commercial
To find the paragraphs with the words "truth" and "life" in them
within the Gospel of John, using the archived new testament on BIX,
with the "csh" shell, and printing out the internal book names as
they are scanned:
scan -aps\\n\\n "-z*john*" biblenew.lha truth..life
This takes about 3 seconds off a harddrive on an A3000.
Note also that to setup an alias in "csh" requires additional
back slashes:
alias scankjv "*aa*ab*ac scan -as\\\\n\\\\n"
Note the *aa*ab*ac in the above alias. It is the way you tell "csh"
not to expand wildcards before passing them to "scan". Up to 3
arguments, in this case, will not be expanded. The only exception
to this is the "|" symbol. Since "csh" uses this for piping, you
have to put quotes around any argument with a "|" in it. You can
use "+" instead of "|" without quotes.
To find all occurances of "faith" and "healed" in the same sentence
in the entire new testament, only printing out the names of the
books with the matches, highlighting the word that matched, and
printing 3 lines of context on both sides of the match:
scan -l3 -z biblenew.* faith&healed+healed&faith
The Amiga style wildcard "#?" may be used interchangably with the
"*" wildcard.
Hints: Search patterns must be at least 2 characters long.
There must be at least two consequtive unique characters
within each major term of the pattern. A major term is delimited
by a "|", "+", or "..". The program will tell you if it cannot
find unique subpatterns.
There are four distinct command formats of Scan. The 1st is a
default format which scans for matching lines, not articles.
The 2nd format has a "-a" in it and scans for articles. The
3rd has a "-f" in it and also scans for articles, but it gets
most of its options from a configuration file. The 4th is used
to print the version or more help information.
Specifying an article separator containing line feeds, can be
done by adding a "\n". Note that some command line interpreters
require an extra slash "\\n". For example, to specify a blank
line as an article separator with csh, use "-s\\n\\n".
Command line options can be grouped together with the exception
of those having arguments. An option with an argument like "-l"
must appear with a separate dash or be the last option in a
dash group.
Some CLI's, along with Scan, limit the command line to 255
characters. When doing wildcard file matching on the command
line, large numbers of matching files can cause a command line
overrun. This is not harmful, but some of the files you expected
to get scanned, won't be. The solution is to put quotes around
the wildcards.
The filename pattern matching algorithm is alot faster than the
one in "csh" and probably other shells. Because of this, it is
alot faster to do:
scan "dh0:work/*" xyz
than to do:
scan dh0:work/* xyz
especially if there are alot of files in the directory.
Author: Walter Rothe
Contact: BIX - aimania OR 2008 Mary St, Carrollton, Tx 75006